{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chap 4. Naive Bayes and Sentiment Classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1\n",
    "\n",
    "Assume the following likelihoods for each word being part of a positive or negative movie review, and equal prior probabilities for each class.\n",
    "\n",
    "What class will Naive bayes assign to the sentence “I always like foreign films.”?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>word</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <td>0.09</td>\n",
       "      <td>0.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>always</th>\n",
       "      <td>0.07</td>\n",
       "      <td>0.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>like</th>\n",
       "      <td>0.29</td>\n",
       "      <td>0.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>foreign</th>\n",
       "      <td>0.04</td>\n",
       "      <td>0.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>films</th>\n",
       "      <td>0.08</td>\n",
       "      <td>0.11</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pos   neg\n",
       "word               \n",
       "I        0.09  0.16\n",
       "always   0.07  0.06\n",
       "like     0.29  0.06\n",
       "foreign  0.04  0.15\n",
       "films    0.08  0.11"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = 'I 0.09 0.16 always 0.07 0.06 like 0.29 0.06 foreign 0.04 0.15 films 0.08 0.11'.split()\n",
    "b = pd.DataFrame(a).values.reshape([-1,3])\n",
    "df = pd.DataFrame(b, columns=['word', 'pos', 'neg']).set_index('word')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "category: neg, P(pos)=2.92e-06, P(neg)=4.75e-06\n"
     ]
    }
   ],
   "source": [
    "p_pos = 0.5\n",
    "p_neg = 0.5\n",
    "for index, row in df.iterrows():\n",
    "    p_pos *= float(row['pos'])\n",
    "    p_neg *= float(row['neg'])\n",
    "\n",
    "cate = 'pos' if p_pos > p_neg else 'neg'\n",
    "print('category: %s, P(pos)=%.2e, P(neg)=%.2e' % (cate, p_pos, p_neg))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pos: 1.04e-03\n",
      "neg: 1.44e-03\n"
     ]
    }
   ],
   "source": [
    "# Only count: I, like, foreign\n",
    "# I and foreign are the main words that scoring neg\n",
    "print('pos: %.2e' % (0.09 * 0.29 * 0.04))\n",
    "print('neg: %.2e' % (0.16 * 0.06 * 0.15))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Notes\n",
    "\n",
    "the classification is not good."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2\n",
    "\n",
    "Given the following short movie reviews, each labeled with a genre, either comedy or action:\n",
    "\n",
    "1. fun, couple, love, love comedy\n",
    "2. fast, furious, shoot action\n",
    "3. couple, fly, fast, fun, fun comedy\n",
    "4. furious, shoot, shoot, fun action\n",
    "5. fly, fast, shoot, love action\n",
    "\n",
    "a new document D:\n",
    "\n",
    "fast, couple, shoot, fly\n",
    "\n",
    "compute the most likely class for D. Assume a naive Bayes classifier and use add-1 smoothing for the likelihoods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['fun', 'couple', 'love', 'love', 'comedy'],\n",
       " ['fast', 'furious', 'shoot', 'action'],\n",
       " ['couple', 'fly', 'fast', 'fun', 'fun', 'comedy'],\n",
       " ['furious', 'shoot', 'shoot', 'fun', 'action'],\n",
       " ['fly', 'fast', 'shoot', 'love', 'action']]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = '''1. fun, couple, love, love comedy\n",
    "2. fast, furious, shoot action\n",
    "3. couple, fly, fast, fun, fun comedy\n",
    "4. furious, shoot, shoot, fun action\n",
    "5. fly, fast, shoot, love action'''\n",
    "docs = [line.replace(',', '').split()[1:] for line in a.split('\\n')]\n",
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'comedy': 0.4, 'action': 0.6}\n",
      "{'comedy': {'fun': 3, 'couple': 2, 'love': 2, 'fly': 1, 'fast': 1}, 'action': {'fast': 2, 'furious': 2, 'shoot': 4, 'fun': 1, 'fly': 1, 'love': 1}}\n",
      "{'love', 'fun', 'shoot', 'couple', 'fast', 'fly', 'furious'}\n"
     ]
    }
   ],
   "source": [
    "label_cnt = {}\n",
    "word_cnt = {}\n",
    "corpus = set()\n",
    "\n",
    "for row in docs:\n",
    "    label = row[-1]\n",
    "    if label not in label_cnt:\n",
    "        label_cnt[label] = 1\n",
    "        word_cnt[label] = {}\n",
    "    else:\n",
    "        label_cnt[label] += 1\n",
    "\n",
    "    label_word_cnt = word_cnt[label]\n",
    "    for word in row[:-1]:\n",
    "        corpus.add(word)\n",
    "        if word not in label_word_cnt:\n",
    "            label_word_cnt[word] = 1\n",
    "        else:\n",
    "            label_word_cnt[word] +=1\n",
    "\n",
    "doc_cnt = len(docs)\n",
    "p_label = {label: 1.0 * cnt / doc_cnt for label, cnt in label_cnt.items()}\n",
    "print(p_label)\n",
    "print(word_cnt)\n",
    "print(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "verify the sum of likely hood in each label:\n",
      "comedy    1.0\n",
      "action    1.0\n",
      "dtype: float64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comedy</th>\n",
       "      <th>action</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>couple</th>\n",
       "      <td>0.1875</td>\n",
       "      <td>0.055556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fast</th>\n",
       "      <td>0.1250</td>\n",
       "      <td>0.166667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fly</th>\n",
       "      <td>0.1250</td>\n",
       "      <td>0.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fun</th>\n",
       "      <td>0.2500</td>\n",
       "      <td>0.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>furious</th>\n",
       "      <td>0.0625</td>\n",
       "      <td>0.166667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>love</th>\n",
       "      <td>0.1875</td>\n",
       "      <td>0.111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>shoot</th>\n",
       "      <td>0.0625</td>\n",
       "      <td>0.277778</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         comedy    action\n",
       "couple   0.1875  0.055556\n",
       "fast     0.1250  0.166667\n",
       "fly      0.1250  0.111111\n",
       "fun      0.2500  0.111111\n",
       "furious  0.0625  0.166667\n",
       "love     0.1875  0.111111\n",
       "shoot    0.0625  0.277778"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "likely_hoods = {}\n",
    "V = len(corpus)\n",
    "for label, cnts in word_cnt.items():\n",
    "    likely_hood = {}\n",
    "    total = sum(cnts.values()) + V\n",
    "    \n",
    "    for w in corpus:\n",
    "        likely_hood[w] = 1.0 * (cnts.get(w, 0) + 1) / total\n",
    "    likely_hoods[label] = likely_hood\n",
    "\n",
    "df=pd.DataFrame(likely_hoods)\n",
    "print('verify the sum of likely hood in each label:')\n",
    "print(df.sum())\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['fast', 'couple', 'shoot', 'fly']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = 'fast, couple, shoot, fly'.replace(',', '').split()\n",
    "doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "prob of comedy: 7.32e-05\n",
      "prob of action: 1.71e-04\n"
     ]
    }
   ],
   "source": [
    "for label, likely in likely_hoods.items():\n",
    "    p_i = p_label[label]\n",
    "    for w in doc:\n",
    "        p_i *= likely[w]\n",
    "    print('prob of %s: %.2e' % (label, p_i))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "属于 action，其中 shoot 是关键识别词。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3\n",
    "\n",
    "Train two models, multinominal naive Bayes and binarized naive Bayes, both with add-1 smoothing, on the following document counts for key sentiment words, with positive or negative class assigned as noted.\n",
    "\n",
    "Use both naive Bayes models to assign a class (pos or neg) to this sentence:\n",
    "\n",
    "A good, good plot and great characters, but poor acting.\n",
    "\n",
    "Do the two models agree or disagree?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>good</th>\n",
       "      <th>poor</th>\n",
       "      <th>great</th>\n",
       "      <th>class</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>doc</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d1</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>pos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d3</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d5</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>neg</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    good poor great class\n",
       "doc                      \n",
       "d1     3    0     3   pos\n",
       "d2     0    1     2   pos\n",
       "d3     1    3     0   neg\n",
       "d4     1    5     2   neg\n",
       "d5     0    2     0   neg"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = 'd1.3 0 3 pos d2.0 1 2 pos d3.1 3 0 neg d4.1 5 2 neg d5.0 2 0 neg'.replace('.', ' ').split()\n",
    "b = pd.DataFrame(a).values.reshape([-1, 5])\n",
    "df = pd.DataFrame(b, columns='doc good poor great class'.split()).set_index('doc')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a',\n",
       " 'good',\n",
       " 'good',\n",
       " 'plot',\n",
       " 'and',\n",
       " 'great',\n",
       " 'characters',\n",
       " 'but',\n",
       " 'poor',\n",
       " 'acting']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = 'A good, good plot and great characters, but poor acting'.lower().replace(',', '').split()\n",
    "doc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### NB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'pos': 0.4, 'neg': 0.6}\n",
      "{'pos': {'good': 3, 'poor': 1, 'great': 5}, 'neg': {'good': 2, 'poor': 10, 'great': 2}}\n",
      "['good', 'poor', 'great']\n"
     ]
    }
   ],
   "source": [
    "label_cnt = {}\n",
    "word_cnt = {}\n",
    "\n",
    "corpus = 'good poor great'.split()\n",
    "\n",
    "for index, row in df.iterrows():\n",
    "    label = row[-1]\n",
    "    if label not in label_cnt:\n",
    "        label_cnt[label] = 1\n",
    "        word_cnt[label] = {}\n",
    "    else:\n",
    "        label_cnt[label] += 1\n",
    "\n",
    "    label_word_cnt = word_cnt[label]\n",
    "    for word, cnt in zip(corpus, row[:-1]):\n",
    "        if word not in label_word_cnt:\n",
    "            label_word_cnt[word] = int(cnt)\n",
    "        else:\n",
    "            label_word_cnt[word] += int(cnt)\n",
    "\n",
    "doc_cnt = len(df)\n",
    "p_label = {label: 1.0 * cnt / doc_cnt for label, cnt in label_cnt.items()}\n",
    "print(p_label)\n",
    "print(word_cnt)\n",
    "print(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "verify the sum of likely hood in each label:\n",
      "pos    1.0\n",
      "neg    1.0\n",
      "dtype: float64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>good</th>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.176471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>great</th>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.176471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>poor</th>\n",
       "      <td>0.166667</td>\n",
       "      <td>0.647059</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            pos       neg\n",
       "good   0.333333  0.176471\n",
       "great  0.500000  0.176471\n",
       "poor   0.166667  0.647059"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "likely_hoods = {}\n",
    "V = len(corpus)\n",
    "for label, cnts in word_cnt.items():\n",
    "    likely_hood = {}\n",
    "    total = sum(cnts.values()) + V\n",
    "    \n",
    "    for w in corpus:\n",
    "        likely_hood[w] = 1.0 * (cnts.get(w, 0) + 1) / total\n",
    "    likely_hoods[label] = likely_hood\n",
    "\n",
    "lh = pd.DataFrame(likely_hoods)\n",
    "print('verify the sum of likely hood in each label:')\n",
    "print(lh.sum())\n",
    "lh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "prob of pos: 3.70e-03\n",
      "prob of neg: 2.13e-03\n"
     ]
    }
   ],
   "source": [
    "for label, likely in likely_hoods.items():\n",
    "    p_i = p_label[label]\n",
    "    for w in doc:\n",
    "        p_i *= likely.get(w, 1)\n",
    "    print('prob of %s: %.2e' % (label, p_i))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Notes\n",
    "\n",
    "the classification is not good, since the freqencies of pos words in the target doc are too high."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### binarized NB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'pos': 0.4, 'neg': 0.6}\n",
      "{'pos': {'good': 1, 'poor': 1, 'great': 2}, 'neg': {'good': 2, 'poor': 3, 'great': 1}}\n",
      "['good', 'poor', 'great']\n"
     ]
    }
   ],
   "source": [
    "label_cnt = {}\n",
    "word_cnt = {}\n",
    "\n",
    "corpus = 'good poor great'.split()\n",
    "\n",
    "for index, row in df.iterrows():\n",
    "    label = row[-1]\n",
    "    if label not in label_cnt:\n",
    "        label_cnt[label] = 1\n",
    "        word_cnt[label] = {}\n",
    "    else:\n",
    "        label_cnt[label] += 1\n",
    "\n",
    "    label_word_cnt = word_cnt[label]\n",
    "    for word, cnt in zip(corpus, row[:-1]):\n",
    "        if word not in label_word_cnt:\n",
    "            label_word_cnt[word] = int(cnt != '0')\n",
    "        else:\n",
    "            label_word_cnt[word] += int(cnt != '0')\n",
    "\n",
    "doc_cnt = len(df)\n",
    "p_label = {label: 1.0 * cnt / doc_cnt for label, cnt in label_cnt.items()}\n",
    "print(p_label)\n",
    "print(word_cnt)\n",
    "print(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "verify the sum of likely hood in each label:\n",
      "pos    1.0\n",
      "neg    1.0\n",
      "dtype: float64\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pos</th>\n",
       "      <th>neg</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>good</th>\n",
       "      <td>0.285714</td>\n",
       "      <td>0.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>great</th>\n",
       "      <td>0.428571</td>\n",
       "      <td>0.222222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>poor</th>\n",
       "      <td>0.285714</td>\n",
       "      <td>0.444444</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            pos       neg\n",
       "good   0.285714  0.333333\n",
       "great  0.428571  0.222222\n",
       "poor   0.285714  0.444444"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "likely_hoods = {}\n",
    "V = len(corpus)\n",
    "for label, cnts in word_cnt.items():\n",
    "    likely_hood = {}\n",
    "    total = sum(cnts.values()) + V\n",
    "    \n",
    "    for w in corpus:\n",
    "        likely_hood[w] = 1.0 * (cnts.get(w, 0) + 1) / total\n",
    "    likely_hoods[label] = likely_hood\n",
    "\n",
    "lh = pd.DataFrame(likely_hoods)\n",
    "print('verify the sum of likely hood in each label:')\n",
    "print(lh.sum())\n",
    "lh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "prob of pos: 4.00e-03\n",
      "prob of neg: 6.58e-03\n"
     ]
    }
   ],
   "source": [
    "for label, likely in likely_hoods.items():\n",
    "    p_i = p_label[label]\n",
    "    for w in doc:\n",
    "        p_i *= likely.get(w, 1)\n",
    "    print('prob of %s: %.2e' % (label, p_i))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Notes\n",
    "\n",
    "the outout is 'neg' which is expected. the improvement is caused by reducing the frequencies of pos words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Summary\n",
    "\n",
    "the two models are disagree.\n",
    "\n",
    "binarized NB works better for sentiment classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
